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Research  Findings 

1  Research  Findings:  Geometric  Clustering  and  its  applications 


In  a  series  of  papers  (jointly  with  collaborators)  [14,  13,  12,  2,  10,  11,  8,  9,  18],  I  was  one  of  the  pioneers 
on  solving  geometric  clustering  problems  using  the  approximation  algorithms.  In  [10,  11],  we  showed  the 
existence  of  0(1/ e)  size  core  sets  for  the  minimum  enclosing  ball  (MEB)  problem.  For  the  minimum  volume 
ellipsoid  (MVE)  problem,  we  have  shown  the  existence  of  0(d2 /e)  size  core-sets  for  (1  +  e)-approximation 
of  the  volume  and  0( d>ogd'j  size  core-sets  for  (1  +  e)  approximation  of  the  radii  [12],  Our  algorithms  for 
both  of  these  problems  are  among  the  best  available,  both  in  terms  of  theoretical  guarantees  and  practical 
results.  Below  I  summarize  recent  findings  that  have  been  enabled  by  the  AFOSR  YIP  Grant  on  various 
problems. 


2  Support  Vector  Machines 

We  have  recently  improved  the  best  core  set  algorithm  [6]  known  for  SVM  problems  [14].  We  observed  that 
one  can  remove  the  assumption  of  using  an  exact  SVM  Solver  for  the  design  of  a  core-set  based  algorithm  to 
solve  the  SVM  problem.  Our  algorithm  outputs  optimal  number  of  support  vectors  (within  a  constant  factor) , 
and  exhibits  linear  convergence.  We  also  show  implementation  results,  and  compare  our  implementation  with 
other  major  first  order  algorithms  available.  This  algorithm  is  big  data  friendly,  cache  oblivious,  easy  to 
parallelize,  apart  from  being  very  simple  to  implement. 


3  New  Approximate  Nearest  Neighbor  Search  Algorithm 

We  have  implemented  a  fast,  dynamic,  approximate,  nearest-neighbor  search  algorithm  that  works  well  in 
fixed  dimensions  (d  <  5),  based  on  sorting  points  (with  integral  coordinates)  in  morton  (or  z-)  ordering  [5]. 
Our  code  seal  es  well  on  multi-core/cpu  shared  memory  systems.  Our  implementation  is  competitive  with 
the  best  approximate  nearest  neighbor  searching  codes  available  on  the  web  [17],  especially  for  creating 
approximate  fc-nearest  neighbor  graphs  of  a  point  cloud.  This  is  joint  work  with  my  graduate  student 
Michael  Connor  [4]. 

An  application  of  approximate  nearest  neighbor  search  is  in  the  computation  of  Group  Enclosing  Queries  [15]. 
Given  a  set  of  points  P  and  a  query  set  Q,  a  group  enclosing  query  (Geq)  fetches  the  point  p*  £  P  such  that 
the  maximum  distance  of  p*  to  all  points  in  Q  is  minimized.  For  instance,  given  a  large  spatial  database 
of  points  of  interest,  such  as  restaurants  or  resorts,  a  group  of  people  trying  to  figure  out  a  place  to  meet 
such  that  the  longest  distance  traveled  by  anyone  in  the  group  is  minimized,  is  an  example  of  Geq.  This 
paper  presents  the  challenges  associated  with  such  a  query  and  proposes  efficient,  R-tree  based  algorithms 
for  Geq.  If  an  exact  answer  is  not  critical,  we  present  a  simple  and  practical  \f2- a  p p r oxi m at i on  algorithm 
and  extend  it  to  retrieve  (1  +  e)-approximate  solutions  for  Geq.  Furthermore,  our  study  on  Geq  reveals 
its  close  relationship  with  the  bichromatic  reverse  furthest  neighbors  problem  (Rfn),  for  which  only  limited 
theoretical  treatment  exists.  As  a  by-product,  we  present  the  first  R.-tree  based  algorithm  for  Rfn.  Our 
algorithms  do  not  assume  that  either  P  or  Q  fits  in  main  memory.  Experiments  on  both  synthetic  and 
real  data  sets  confirm  the  superior  efficiency  and  scalability  of  proposed  algorithms  over  the  naive,  brute- 
force  search  based  approach.  We  also  made  progress  on  the  K-Nearest  neighbor  Joins  in  large  relational 
databases  [19]. 
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3.1  Surface  Reconstruction 


In  this  part  of  the  project,  we  use  the  implementation  for  finding  fast  approximate  dynamic  nearest  neigh¬ 
bors  to  implement  a  fast,  out  of  core,  streaming,  parallel,  surface  reconstruction  algorithm.  We  have  been 
experimenting  with  non-linear  dimensionality  reduction  methods  for  use  in  surface  reconstruction  applica¬ 
tions.  We  also  show  how  our  algorithm  scales  on  multi- processor/core  systems.  This  is  joint  work  with  my 
graduate  student,  James  McClain. 

On  a  related  subject,  we  made  progress  on  accurate  localization  of  RFID  tags  in  three  dimensions  [7]. 


4  Clustering  on  Road  Networks 

We  studied  the  1-center  problem  on  road  networks,  an  important  problem  in  GIS.  Using  Euclidean  embed¬ 
dings,  and  reduction  to  fast  nearest  neighbor  search,  we  devise  an  approximation  algorithm  for  this  problem. 
Our  initial  experiments  on  real  world  data  sets  indicate  fast  computation  of  constant  factor  approximate 
solutions  for  query  sets  much  larger  than  previously  computable  using  exact  techniques  [3,  16]. 


5  Bichromatic  2-Center  of  Pairs  of  Points 

This  study  is  motivated  by  a  facility  location  problem  in  transportation  system  design,  in  which  we  are  given 
origin/destination  pairs  of  points  for  desired  travel,  and  our  goal  is  to  locate  an  optimal  road/flight  segment 
in  order  to  minimize  the  travel  to/from  the  endpoints  of  the  segment.  We  considered  various  variants  of 
this  problem,  under  different  metrics  and  came  up  with  efficient  algorithms,  both  approximation  and  exact 
during  this  research  [1].  Most  of  the  linear  or  near-linear  time  algorithms  found  during  this  research  were 
inspired  by  the  core-set  approach. 
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A.  PROJECT  ACTIVITIES 


1  Project  Summary 

The  primary  objective  of  this  YIP  proposal  is  the  design,  analysis  and  implementation  of  efficient  algorithms 
for  geometric  clustering  problems  and  their  applications.  Core-Sets  are  ideal  for  many  such  designs.  Using 
this  paradigm,  one  computes  a  small  but  ‘most  relevant’  subset  of  the  input,  solves  the  optimization  problem 
on  this  small  subset  thereby  calculating  an  approximate  solution  to  the  original  problem  with  proven  accuracy 
and  efficiency.  The  main  focus  here  will  be  on  the  design,  analysis,  and  implementation  of  efficient  algorithms 
for  problems  arising  in  computational  geometry  and  discrete  optimization. 

The  educational  component  of  this  project  includes  reworking  the  algorithms  and  computational  geometry 
courses,  disseminating  teaching  material  over  the  web,  directing  undergraduate  and  graduate  student  and 
software  development  integration  into  education. 


2  Activites 

Research  activities  have  been  undertaken  for  each  component  of  this  project.  We  report  the  advances  we  have 
made  in  the  following  subsections  (in  terms  of  observations,  simulations,  experiments  and  presentations). 

2.1  Geometric  Clustering  on  Maps 

One  of  the  applications  we  worked  on  is  the  computation  of  centers  for  Geographic  Information  System 
applications.  We  built  algorithms  and  software  that  would  be  able  to  compute  meeting  points  for  multiple 
people/addresses  accurately  and  in  real  time.  Work  on  this  project  has  begun  and  a  very  early  prototype 
can  be  found  at:  http:/ /maps. compgeom.com  We  also  published  this  work  in  ACM  SIGSPATIAL  2011.  This 
work  was  later  extended  to  allow  multiple  center  computations  in  real  time.  This  work  also  won  the  NSF 
ICorps  Award  in  2012. 

2.2  Approximate  nearest  neighbor  search 

In  this  part  of  the  project,  we  have  implemented  and  experimented  with  a  fast  dynamic  approximate  nearest 
neighbor  search  algorithm  that  works  well  in  fixed  dimensions  (n  <  5).  Our  algorithm  scales  well  on  multi¬ 
core  /  multi-cpu  shared  memory  systems.  Preliminary  experiments  showed  that  our  method  is  competitive 
with  the  best  approximate  nearest  neighbor  searching  codes  available  on  the  web  (ANN  by  David  Mount). 
A  preliminary  software  release  related  to  this  project  can  be  found  at:  http://compgeom.com/~stann 

Applications  to  Spatial  Databases:  Finding  the  k  nearest  neighbors  (kNN)  of  a  query  point,  or  a  set 
of  query  points  (kNN-Join)  are  fundamental  problems  in  many  database  application  in  the  Air  Force.  Most 
of  the  previous  efforts  to  solve  these  problems  focused  on  spatial  databases  or  stand-alone  systems,  where 
changes  to  the  database  engine  are  required,  which  greatly  limits  their  application  on  large  data  sets  that 
are  stored  in  a  relational  database  management  system.  Furthermore,  these  methods  cannot  automatically 
optimize  kNN  queries  or  kNN-Joins  when  additional  query  conditions  are  specified.  In  this  work,  we  study 
both  the  kNN  query  and  the  kNN-Join  in  a  relational  database,  possibly  augmented  with  additional  query 
conditions.  We  search  for  relational  algorithms  that  require  no  changes  to  the  database  engine.  The 
straightforward  solution  uses  the  user-defined-function  (UDF)  that  a  query  optimizer  cannot  optimize.  We 
design  algorithms  that  could  be  implemented  by  SQL  operators  without  changes  to  the  database  engine, 
hence  enabling  the  query  optimizer  to  understand  and  generate  the  best  query  plan.  Extensive  experiments 
on  large,  real  and  synthetic,  data  sets  confirm  the  superior  efficiency  and  practicality  of  our  approach, 
compared  to  the  state  of  the  art. 
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Applications  to  Geometric  Minimum  Spanning  Tree:  We  have  recently  devised  an  algorithm  that 
can  compute  GMSTs  on  multi-core  machines.  We  call  it  GeoFilterKruskal,  an  algorithm  that  computes  the 
minimum  spanning  tree  of  a  set  of  points  P ,  using  well  separated  pair  decomposition  in  combination  with  a 
simple  modification  of  Kruskal’s  algorithm.  When  P  is  sampled  from  uniform  random  distribution,  we  show 
that  our  algorithm  runs  in  0(n\og2  n)  time  with  high  probability.  Experiments  show  that  our  algorithm 
works  better  in  practice  for  most  data  distributions  compared  to  the  current  state  of  the  art.  Our  algorithm 
is  easy  to  parallelize  and  to  our  knowledge,  is  currently  the  best  practical  algorithm  on  multi-core  machines 
for  dimensions  greater  than  two.  The  GMST  algorithm  was  released  as  an  open  source  code  to  the  public 
as  part  of  the  STANN  library. 

Fast  Exact  nearest  neighbors  in  2D:  In  2010,  we  showed  that  using  some  very  simple  practical 
assumptions,  one  can  design  an  algorithm  that  finds  the  nearest  neighbor  of  a  given  query  point  in  0(log?r) 
time  in  theory  and  faster  than  the  state  of  the  art  in  practice.  The  algorithm  and  proof  are  both  simple 
and  the  experimental  results  clearly  show  that  we  can  beat  the  state  of  the  art  on  most  distributions  in  two 
dimensions. 


2.3  Surface  Reconstruction 

In  this  part  of  the  project,  we  use  the  implementation  for  finding  fast  approximate  dynamic  nearest  neighbors 
to  implement  a  fast,  out  of  core,  streaming,  parallel,  surface  reconstruction  algorithm.  We  have  also 
been  experimenting  with  non-linear  dimensionality  reduction  methods  for  use  in  surface  reconstruction 
applications.  We  have  begun  to  get  results  for  non-smooth  surfaces  in  this  area  and  hope  to  report  our 
results  soon  to  leading  conferences. 

A  related  endeavor  we  embarked  on  is  accurate  localization  of  RFID  tags  in  three  dimensions.  We  have  also 
made  progress  on  this  problem. 

2.4  Core-Sets:  Support  Vector  Machines 

We  have  recently  improved  the  best  core  set  algorithm  known  for  SVM  problems.  We  observed  that  one  can 
remove  the  assumption  of  using  an  exact  SVM  Solver  for  the  design  of  a  core-set  based  algorithm  to  solve 
the  SVM  problem.  We  are  currently  in  the  process  of  implementing  our  algorithm. 

2.5  Bichromatic  2-Center  clustering  for  pairs  of  points 

We  study  a  class  of  geometric  optimization  problems  closely  related  to  the  2-center  problem:  Given  a  set 
S  of  n  pairs  of  points,  assign  to  each  point  a  color  (red  or  blue)  so  that  each  pairs  points  are  assigned 
different  colors  and  a  function  of  the  radii  of  the  minimum  enclosing  balls  of  the  red  points  and  the  blue 
points,  respectively,  is  optimized.  In  particular,  we  consider  the  problems  of  minimizing  the  maximum  and 
minimizing  the  sum  of  the  two  radii.  For  each  case,  minrnax  and  minsum,  we  consider  distances  measured  in 
the  L2  and  in  the  metrics.  Our  problems  are  motivated  by  a  facility  location  problem  in  transportation 
system  design,  in  which  we  are  given  origin/destination  pairs  of  points  for  desired  travel,  and  our  goal  is  to 
locate  an  optimal  road/flight  segment  in  order  to  minimize  the  travel  to/from  the  endpoints  of  the  segment. 

2.6  Education 

The  PI  is  continuously  integrating  research  findings  from  this  project  into  his  graduate  courses.  The  PI  has 
completely  reworked  and  developed  a  new  curriculum  for  his  graduate  course  on  computational  geometry. 
The  homework  assignments  in  this  course  involve  computational  projects  that  require  each  student  to 
formulate  and  code  geometric  optimization  problems.  This  experience  has  been  extremely  valuable  as  many 
graduate  students  continued  to  use  core-set  techniques  in  their  own  research. 
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Since  the  Pis  interest  are  both  in  theory  and  implementation,  he  recently  moved  his  group  to  the  Python 
programming  language  and  has  designed  a  new  course  on  the  subject  so  that  not  only  his  group,  but  the 
students  at  FSU  can  benefit  from  such  a  course.  The  Python  course  that  he  designed  covers  not  only  the 
basics  of  Python  but  more  advanced  material  like  metaprogramming  and  decorators  in  detail.  In  the  near 
future,  the  PI  believes  that  this  will  help  the  students  get  things  done  much  faster  than  the  students  coding 
in  C++. 

The  AFOSR  Funding  helped  two  PhD  Students  to  graduate  and  get  well  placed.  Michael  Connor  now  works 
for  NSA  and  Sarnidh  Chatterjee  works  in  a  startup  named  AirSage  that  processes  large  data  mobile  paths. 
Two  more  PhD  students  have  availed  this  funding  and  are  close  to  graduation. 

2.7  Presentations 

The  PI  and  his  students  made  the  following  presentations  during  the  course  of  the  project,  till  date  (starting 
March  2010): 

•  K  Nearest  Neighbor  Queries  and  KNN- Joins  in  Large  Relational  Databases  (Almost)  for  Free.  At  26th 
IEEE  International  Conference  on  Data  Engineering.  Long  Beach,  California,  USA  ,  March  2010. 

•  Nearest  neighbors  in  the  Plane.  At  Brooklyn  Polytecnic,  NY.  April  2010. 

•  Accurate  Localization  of  RFID  Tags  Using  Phase  Difference.  At  IEEE  RFID,  Orlando,  FL.  April  2010. 

•  Nearest  neighbors  in  the  Plane.  At  Symposium  on  Experimental  Algorithmics,  Napoli,  Italy.  May 
2010. 

•  Geometric  Minimum  Spanning  Trees  with  GeoFilterKruskal.  At  Symposium  on  Experimental 
Algorithmics,  Napoli,  Italy.  May  2010. 

•  Core-sets  and  its  applications.  REEF,  Destin,  FL.  Aug  2010. 

•  Minimum  Error  Rate  Training  by  Sampling  the  Translation  Lattice.  At  2010  Conference  on  Empirical 
Methods  in  Natural  Language  Processing  (EMNLP  2010).  Boston,  MA.  October  2010. 

•  Instant  Approximate  1-Center  on  Road  Networks  Via  Embeddings.  At  ACM  SIGSPATIAL  GIS,  2011. 

•  Instant  Approximate  1- Center  on  Road  Netwroks  Via  Embeddings.  At  TAMU,  College  Station,  TX, 
October  2011. 

•  Clustering  Large  Data  sets.  At  BP  Research,  Houston,  TX,  Aug  2011. 

•  Fast  k-clustering  Queries  on  Road  Networks.  At  Com. Geo  2012,  Reston,  VA,  2012. 

•  Bichromatic  2-Center  of  Pairs  of  Points.  At  LATIN  2012,  Arequipa,  Peru,  2012. 

•  Fast  Multiple  center  map  queries  At  NSF  ICorps,  Ann  Arbor,  2012. 
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